02:54
2026-06-24
lesswrong.com
natural-language-processing
Can You Hide From a Natural Language Autoencoder?
Researchers stress-tested Natural Language Autoencoders (NLAs) by optimizing activation vectors to flip AV explanations while preserving model behavior, achieving an 81.4% flip rate with 99.6% label pโฆ